Automated Text Classification in the DMOZ Hierarchy

نویسنده

  • Lachlan Henderson
چکیده

The growth in the availability of on-line digital text documents has prompted considerable interest in Information Retrieval and Text Classification. Automation of the management of this wealth of textual data is becoming an increasingly important endeavor as the rate of new material continues to grow at its substantial rate. The open directory project (ODP) also known as DMOZ is an on-line service which provides a searchable and browsable hierarchically organised directory to facilitate access to the Internets’ resources. This resource is considerably useful for the construction of intelligent systems for on-line content management. In this report the utility of the publicly available Open Directory Project data for the classification of World Wide Web (WWW) text documents is investigated. The resource is sampled and a range of algorithms are applied to the task namely, Support Vector Machines (SVM), Multi-class Rocchio (Centroid), k-Nearest Neighbour, and Naı̈ve Bayes (NB). The theoretical and implementation details of the four text classification systems are discussed. Results from the tuning and performance of these algorithms are analysed and compared with published results. Related work from the areas of both text classification and classification in general is surveyed. Some of the unique issues of large scale multi-class text classification are identified and analysed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CatS: A Classification-Powered Meta-Search Engine

CatS is a meta-search engine that utilizes text classification techniques to improve the presentation of search results. After posting a query, the user is offered an opportunity to refine the results by browsing through a category tree derived from the dmoz Open Directory topic hierarchy. This paper describes some key aspects of the system (including HTML parsing, classification and displaying...

متن کامل

ارتقای کیفیت دسته‌بندی متون با استفاده از کمیته‌ دسته‌بند دو سطحی

Nowadays, the automated text classification has witnessed special importance due to the increasing availability of documents in digital form and ensuing need to organize them. Although this problem is in the Information Retrieval (IR) field, the dominant approach is based on machine learning techniques. Approaches based on classifier committees have shown a better performance than the others. I...

متن کامل

ارائه روشی برای استخراج کلمات کلیدی و وزن‌دهی کلمات برای بهبود طبقه‌بندی متون فارسی

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...

متن کامل

Document Representations for Classification of Short Web-Page Descriptions

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-ofwords document representations on the performance of five major classifiers – Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from th...

متن کامل

Interactions Between Document Representation and Feature Selection in Text Categorization

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009